Skip to content

[TRTLLM-8952][feat] Support Multi-Node Disagg Perf Test in CI#9138

Merged
chenfeiz0326 merged 3 commits intoNVIDIA:mainfrom
chenfeiz0326:chenfeiz/support-disagg-perf-test
Dec 26, 2025
Merged

[TRTLLM-8952][feat] Support Multi-Node Disagg Perf Test in CI#9138
chenfeiz0326 merged 3 commits intoNVIDIA:mainfrom
chenfeiz0326:chenfeiz/support-disagg-perf-test

Conversation

@chenfeiz0326
Copy link
Collaborator

@chenfeiz0326 chenfeiz0326 commented Nov 13, 2025

Summary by CodeRabbit

Release Notes

  • New Features
    • Add multi-node disagg tests for perf regression CI system.
    • Add write_regressive_test_cases to gather regression testcases for pytest
    • Use get_free_port for aggr and disagg's server's port.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@chenfeiz0326 chenfeiz0326 requested review from a team as code owners November 13, 2025 10:19
@chenfeiz0326 chenfeiz0326 requested review from kxdc and poweiw November 13, 2025 10:19
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 13, 2025

📝 Walkthrough

Walkthrough

This PR introduces multi-node disaggregated performance testing infrastructure for Slurm-based deployments. It adds a new Groovy orchestration method, four Slurm execution scripts (Bash/Python), refactors core test configuration handling to support aggregated and multi-node disaggregated server architectures, extends performance database utilities with GPU type filtering, and introduces new test configurations and execution parameters for multi-node GB200 scenarios.

Changes

Cohort / File(s) Change Summary
Jenkins Orchestration
jenkins/L0_Test.groovy
Added new method runLLMMultiNodeDisaggTestlistWithSbatch to launch and manage multi-node disaggregated Slurm tests, including file transfers, test rendering, and sbatch construction; updated routing in runLLMTestlistOnSlurm to invoke multi-node disagg path when nodeCount > 1 and stage contains "Perf-Sanity-Disagg"; expanded post-merge perf-sanity configs with multi-node variants.
Slurm Execution Scripts
jenkins/scripts/perf/disaggregated/{install.sh, slurm_exec.py, slurm_launch.sh, slurm_run.sh}
Added four new scripts: install.sh prepares compute nodes with Python/dependencies and TensorRT-LLM wheel; slurm_exec.py orchestrates SLURM job submission and monitoring via sbatch; slurm_launch.sh coordinates container deployment, server startup (ctx/gen), and disaggregated benchmark execution; slurm_run.sh handles pytest invocation, library path setup, and optional performance validation.
Performance Test Configuration
tests/integration/defs/perf/test_perf.py
Refactored ServerConfig and ClientConfig constructors to accept dictionaries; added disagg_configs field to PerfTestConfig; introduced new public methods get_trtllm_aggr_commands and get_trtllm_multi_node_disagg_commands; replaced server-only paths with aggregated and multi-node disaggregated workflows; extended to_cmd/to_db_data to accept hostname/port and include new fields (tp, ep, pp, gpus_per_node, etc.).
Performance Test Utilities
tests/integration/defs/perf/utils.py
Renamed PerfServerClientBenchmarkCmds to PerfAggrScriptTestCmds; introduced new PerfMultiNodeDisaggScriptTestCmds type with ctx/gen/disagg server commands and benchmark commands; updated endpoint health checks and timeout configurations; added socket/YAML imports for multi-node config generation.
Performance Database Utilities
tests/integration/defs/perf/open_search_db_utils.py
Added gpu_type filtering to query_history_data and get_history_data; replaced field-skipping logic to exclude @, _, and ts\_ prefixed fields; changed baseline tracking from single value to list per cmd_idx; added regression awareness (b_is_regression == False constraint); removed SERVER_FIELDS and CLIENT_FIELDS constants.
Perf-Sanity Configurations
tests/scripts/perf-sanity/{l0_dgx_b200.yaml, l0_dgx_b300.yaml, l0_gb200_multi_nodes.yaml, l0_gb200_multi_nodes_disagg.yaml}
Added/restructured server configurations with nested blocks (attention_dp_config, moe_config, cuda_graph_config, kv_cache_config, speculative_config); added multi-node GB200 configs (both aggregated and disaggregated); updated client configs with new naming and parameter adjustments; introduced disagg_configs block for multi-node disaggregated benchmarks.
Test Lists
tests/integration/test_lists/test-db/{perf_sanity_l0_dgx_b200.yml, perf_sanity_l0_gb200_multi_nodes.yml}
Expanded DGX B200 test matrix with pre/post-merge variants for multi-parameter configs (fp4, tensor parallel, expert parallel); added new multi-node GB200 test entries with 8-GPU, 2-node configuration targeting pre/post-merge stages.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Jenkins as Jenkins<br/>(L0_Test.groovy)
    participant PyOrch as Python<br/>(slurm_exec.py)
    participant Slurm as Slurm sbatch
    participant Launch as Slurm Launch Script<br/>(slurm_launch.sh)
    participant Compute as Compute Nodes
    participant Run as Test Runner<br/>(slurm_run.sh)

    User->>Jenkins: Trigger multi-node disagg test
    Jenkins->>Jenkins: runLLMMultiNodeDisaggTestlistWithSbatch()
    Jenkins->>Jenkins: Prepare test files & configs
    
    Jenkins->>PyOrch: Execute slurm_exec.py
    PyOrch->>PyOrch: Parse YAML, compute topology<br/>(ctx_servers, gen_servers, total_nodes)
    PyOrch->>Slurm: Submit sbatch job
    Slurm->>Launch: Allocate nodes & spawn srun
    
    Launch->>Compute: install.sh on each node
    Compute->>Compute: Setup Python, install wheel
    
    Launch->>Compute: Start gen servers (srun)
    Launch->>Compute: Start ctx servers (srun)
    Launch->>Compute: Start disagg server + benchmark (srun)
    
    Compute->>Run: Execute pytest via slurm_run.sh
    Run->>Run: Set LD_LIBRARY_PATH<br/>Inject TensorRT wheel
    Run->>Compute: Run performance tests
    
    PyOrch->>Slurm: Poll job status via squeue
    Slurm-->>PyOrch: Job completion
    PyOrch->>PyOrch: Check exit code via sacct
    PyOrch-->>Jenkins: Return status
Loading
sequenceDiagram
    participant Config as Config YAML
    participant TestPerf as test_perf.py
    participant Parser as Config Parser
    participant Commands as Command Gen

    Config->>Parser: parse_aggr_config_file() / parse_multi_node_disagg_config_file()
    
    alt Aggregated Server Path
        Parser->>TestPerf: ServerConfig(server_config_dict)
        Parser->>TestPerf: ClientConfig(client_config_dict)
        TestPerf->>Commands: get_trtllm_aggr_commands()
        Commands-->>TestPerf: [server_cmds, client_cmds]
    else Multi-Node Disaggregated Path
        Parser->>TestPerf: disagg_configs[] + ServerConfig
        TestPerf->>Commands: get_trtllm_multi_node_disagg_commands()
        Commands-->>TestPerf: PerfMultiNodeDisaggScriptTestCmds<br/>(ctx_cmds, gen_cmds, disagg_cmds, benchmark_cmds)
    end
    
    TestPerf->>TestPerf: set_aggr_server_configs() / set_multi_node_disagg_server_configs()
    TestPerf-->>TestPerf: Update PerfTestConfig with commands
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Key areas requiring attention:
    • test_perf.py ServerConfig/ClientConfig constructor refactoring from explicit parameters to dictionary-based initialization; validate all config field extractions and nested config structure parsing
    • Multi-node disaggregated command generation logic in get_trtllm_multi_node_disagg_commands and related topology computation (ctx/gen server counts, node distribution)
    • Slurm script orchestration and error handling flow across four scripts (install.sh, slurm_exec.py, slurm_launch.sh, slurm_run.sh); validate task allocation, signal propagation, and cleanup logic
    • Database query and upload changes in open_search_db_utils.py with gpu_type filtering and regression awareness; verify regression flag logic and baseline selection
    • Configuration YAML schema modernization and nested config block structure; ensure all new fields are correctly parsed and propagated through command generation
    • Test list and configuration parameter matrix consistency across pre/post-merge variants

Possibly related PRs

Suggested reviewers

  • Tabrizian
  • Superjomn
  • SimengLiu-nv
  • chzblych
  • nvchenghaoz

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 48.08% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description check ⚠️ Warning The PR description is entirely empty—all required sections (Description, Test Coverage) are blank with only template placeholders remaining. Fill in the Description section with what changes are being made and why. Provide Test Coverage section listing relevant tests that safeguard the changes. Complete other checklist items as applicable.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main feature: multi-node disaggregated performance testing support in CI, with appropriate JIRA ticket reference and [feat] type indicator.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/integration/defs/perf/utils.py (1)

276-339: Do not pass env to defs.trt_test_alternative.check_output.

check_output from defs.trt_test_alternative only accepts a command argument (see Line 58 in tests/integration/defs/perf/misc.py). Passing env=… here throws a TypeError, so aggregated/disagg perf runs will fail before the benchmark starts. Please switch these call sites (Line 276 for the client path, Lines 338-339 for the disagg path, and the multi-node branch later in this file) to subprocess.check_output(...).decode() or another helper that supports env.

-                output += check_output(self.client_cmds[cmd_idx],
-                                       env=venv._new_env)
+                output_bytes = subprocess.check_output(
+                    self.client_cmds[cmd_idx], env=venv._new_env
+                )
+                output += output_bytes.decode()
🧹 Nitpick comments (9)
jenkins/scripts/perf/disaggregated/install.sh (2)

13-13: Consider running apt-get update before installing packages.

The apt-get install -y libffi-dev command might fail if the package lists are stale. Consider adding apt-get update -y before the install command to ensure the latest package information is available.

Apply this diff:

 python3 --version
-apt-get install -y libffi-dev
+apt-get update -y && apt-get install -y libffi-dev
 nvidia-smi && nvidia-smi -q && nvidia-smi topo -m

9-10: Add explicit error messages for download and extraction failures.

While set -Eeuo pipefail will cause the script to exit on failure, adding explicit error checking with informative messages would improve debuggability when tarball downloads or extractions fail.

Apply this diff:

-wget -nv $llmTarfile
-tar -zxf $tarName
+wget -nv $llmTarfile || { echo "Error: Failed to download $llmTarfile"; exit 1; }
+tar -zxf $tarName || { echo "Error: Failed to extract $tarName"; exit 1; }
 which python3
jenkins/scripts/perf/disaggregated/slurm_exec.py (1)

178-186: Consider more specific exception handling.

The broad except Exception as e: clause on line 178 catches all exceptions, which could potentially mask programming errors. Consider catching specific exceptions or at least logging more details about unexpected exceptions.

Apply this diff to improve error visibility:

-    except Exception as e:
+    except (OSError, subprocess.SubprocessError) as e:
         print(f"Error during job monitoring: {e}")
+        import traceback
+        traceback.print_exc()
         tail_proc.terminate()
tests/integration/defs/perf/test_perf.py (6)

570-576: Add else clause for unexpected types in eagle3_layers_to_capture.

The code handles int and list types but silently defaults to an empty list for any other type. Consider adding logging or an assertion to catch unexpected types during development.

Apply this diff to add defensive handling:

         eagle3_value = speculative_config.get('eagle3_layers_to_capture', [])
         if isinstance(eagle3_value, int):
             self.eagle3_layers_to_capture = [eagle3_value]
         elif isinstance(eagle3_value, list):
             self.eagle3_layers_to_capture = eagle3_value
         else:
+            print_warning(f"Unexpected type for eagle3_layers_to_capture: {type(eagle3_value)}, defaulting to empty list")
             self.eagle3_layers_to_capture = []

721-747: Consider removing unused parameter if truly unnecessary.

Static analysis flags working_dir as unused. If it's not needed for future extensions, consider removing it to reduce confusion.


808-808: Add explicit Optional type hint.

Static analysis recommends using explicit Optional[str] or str | None instead of implicit None default for the select_pattern parameter.

Apply this diff:

-def parse_aggr_config_file(config_file_path: str, select_pattern: str = None):
+def parse_aggr_config_file(config_file_path: str, select_pattern: str | None = None):

868-925: Add explicit Optional type hint and document port selection logic.

  1. Static analysis recommends explicit Optional[str] or str | None for the select_pattern parameter.
  2. The port selection logic (lines 877-879) is unclear: why does having both "CTX" and "GEN" in the index result in port 8336 vs 8333? Add a comment explaining this distinction.

Apply this diff:

-def parse_multi_node_disagg_config_file(config_file_path: str,
-                                        select_pattern: str = None):
+def parse_multi_node_disagg_config_file(config_file_path: str,
+                                        select_pattern: str | None = None):
     # Get disagg_server_idx from environment variable
     disagg_server_idx = os.environ.get("DISAGG_SERVER_IDX", "DISAGG")

     # Get hostname
     hostname = socket.gethostname()

-    # Determine port based on disagg_server_idx
+    # Determine port based on disagg_server_idx
+    # Worker nodes (CTX/GEN) use port 8336, orchestrator node uses 8333
     port = 8333
     if "CTX" in disagg_server_idx and "GEN" in disagg_server_idx:
         port = 8336

2499-2500: Add explicit Optional type hints.

Static analysis recommends explicit Optional or union with None for the optional parameters.

Apply this diff:

     def _get_metric_name(self,
                          metric_type: PerfMetricType,
                          bs: int = None,
                          input_len: int = None,
                          output_len: int = None,
                          server_name: str = None,
-                         client_name: str = None,
-                         disagg_config_name: str = None) -> str:
+                         client_name: str | None = None,
+                         disagg_config_name: str | None = None) -> str:

2542-2549: Consider extracting exception messages to constants.

Static analysis suggests avoiding long exception messages inline. While this is a minor style issue, extracting these to constants could improve maintainability.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 48a27c7 and 934f0ea.

📒 Files selected for processing (14)
  • jenkins/L0_Test.groovy (3 hunks)
  • jenkins/scripts/perf/disaggregated/install.sh (1 hunks)
  • jenkins/scripts/perf/disaggregated/slurm_exec.py (1 hunks)
  • jenkins/scripts/perf/disaggregated/slurm_launch.sh (1 hunks)
  • jenkins/scripts/perf/disaggregated/slurm_run.sh (1 hunks)
  • tests/integration/defs/perf/open_search_db_utils.py (6 hunks)
  • tests/integration/defs/perf/test_perf.py (30 hunks)
  • tests/integration/defs/perf/utils.py (8 hunks)
  • tests/integration/test_lists/test-db/perf_sanity_l0_dgx_b200.yml (3 hunks)
  • tests/integration/test_lists/test-db/perf_sanity_l0_gb200_multi_nodes.yml (1 hunks)
  • tests/scripts/perf-sanity/l0_dgx_b200.yaml (1 hunks)
  • tests/scripts/perf-sanity/l0_dgx_b300.yaml (1 hunks)
  • tests/scripts/perf-sanity/l0_gb200_multi_nodes.yaml (1 hunks)
  • tests/scripts/perf-sanity/l0_gb200_multi_nodes_disagg.yaml (1 hunks)
🧰 Additional context used
🧠 Learnings (10)
📚 Learning: 2025-08-26T09:49:04.956Z
Learnt from: pengbowang-nv
Repo: NVIDIA/TensorRT-LLM PR: 7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").

Applied to files:

  • tests/scripts/perf-sanity/l0_dgx_b300.yaml
  • tests/integration/test_lists/test-db/perf_sanity_l0_dgx_b200.yml
  • jenkins/scripts/perf/disaggregated/slurm_run.sh
  • tests/scripts/perf-sanity/l0_gb200_multi_nodes.yaml
  • tests/integration/test_lists/test-db/perf_sanity_l0_gb200_multi_nodes.yml
  • jenkins/L0_Test.groovy
  • tests/scripts/perf-sanity/l0_dgx_b200.yaml
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • jenkins/scripts/perf/disaggregated/install.sh
  • jenkins/scripts/perf/disaggregated/slurm_run.sh
📚 Learning: 2025-08-11T20:09:24.389Z
Learnt from: achartier
Repo: NVIDIA/TensorRT-LLM PR: 6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.

Applied to files:

  • jenkins/scripts/perf/disaggregated/install.sh
  • jenkins/scripts/perf/disaggregated/slurm_run.sh
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • jenkins/scripts/perf/disaggregated/install.sh
📚 Learning: 2025-08-18T09:08:07.687Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 6984
File: cpp/tensorrt_llm/CMakeLists.txt:297-299
Timestamp: 2025-08-18T09:08:07.687Z
Learning: In the TensorRT-LLM project, artifacts are manually copied rather than installed via `cmake --install`, so INSTALL_RPATH properties are not needed - only BUILD_RPATH affects the final artifacts.

Applied to files:

  • jenkins/scripts/perf/disaggregated/install.sh
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • jenkins/scripts/perf/disaggregated/install.sh
  • tests/integration/test_lists/test-db/perf_sanity_l0_dgx_b200.yml
  • jenkins/scripts/perf/disaggregated/slurm_run.sh
📚 Learning: 2025-09-16T09:30:09.716Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7763
File: cpp/tensorrt_llm/CMakeLists.txt:297-301
Timestamp: 2025-09-16T09:30:09.716Z
Learning: In the TensorRT-LLM project, NCCL libraries are loaded earlier by PyTorch libraries or the bindings library, so the main shared library doesn't need NCCL paths in its RPATH - the libraries will already be available in the process address space when needed.

Applied to files:

  • jenkins/scripts/perf/disaggregated/install.sh
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

  • tests/integration/test_lists/test-db/perf_sanity_l0_dgx_b200.yml
  • jenkins/scripts/perf/disaggregated/slurm_run.sh
📚 Learning: 2025-08-29T14:07:45.863Z
Learnt from: EmmaQiaoCh
Repo: NVIDIA/TensorRT-LLM PR: 7370
File: tests/unittest/trt/model_api/test_model_quantization.py:24-27
Timestamp: 2025-08-29T14:07:45.863Z
Learning: In TensorRT-LLM's CI infrastructure, pytest skip markers (pytest.mark.skip) are properly honored even when test files have __main__ blocks that call test functions directly. The testing system correctly skips tests without requiring modifications to the __main__ block execution pattern.

Applied to files:

  • jenkins/scripts/perf/disaggregated/slurm_run.sh
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.

Applied to files:

  • tests/integration/defs/perf/test_perf.py
🧬 Code graph analysis (3)
jenkins/scripts/perf/disaggregated/slurm_run.sh (1)
tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (1)
  • env (61-68)
tests/integration/defs/perf/utils.py (2)
tests/integration/defs/perf/misc.py (1)
  • check_output (59-69)
tests/integration/defs/trt_test_alternative.py (4)
  • popen (199-218)
  • print_error (318-324)
  • print_info (300-306)
  • print_warning (309-315)
tests/integration/defs/perf/test_perf.py (1)
tests/integration/defs/perf/utils.py (7)
  • AbstractPerfScriptTestClass (579-894)
  • PerfAggrScriptTestCmds (240-288)
  • PerfBenchScriptTestCmds (133-237)
  • PerfDisaggScriptTestCmds (291-350)
  • PerfMetricType (89-119)
  • PerfMultiNodeDisaggScriptTestCmds (353-576)
  • generate_test_nodes (950-984)
🪛 Ruff (0.14.4)
jenkins/scripts/perf/disaggregated/slurm_exec.py

1-1: Shebang is present but file is not executable

(EXE001)


103-103: subprocess call with shell=True identified, security issue

(S602)


127-127: subprocess call: check for execution of untrusted input

(S603)


127-127: Starting a process with a partial executable path

(S607)


128-128: subprocess call: check for execution of untrusted input

(S603)


129-129: Starting a process with a partial executable path

(S607)


139-139: subprocess call with shell=True identified, security issue

(S602)


157-157: subprocess call with shell=True identified, security issue

(S602)


178-178: Do not catch blind exception: Exception

(BLE001)

tests/integration/defs/perf/open_search_db_utils.py

362-363: Avoid equality comparisons to True; use ...: for truth checks

Replace comparison

(E712)

tests/integration/defs/perf/utils.py

251-251: Probable use of requests call without timeout

(S113)


254-254: Do not catch blind exception: Exception

(BLE001)


275-275: f-string without any placeholders

Remove extraneous f prefix

(F541)


377-378: Avoid specifying long messages outside the exception class

(TRY003)


416-416: Undefined name num_ctx_servers

(F821)


420-420: Undefined name num_gen_servers

(F821)


437-437: Probable use of requests call without timeout

(S113)


440-440: Do not catch blind exception: Exception

(BLE001)


484-484: Function call with shell=True parameter identified, security issue

(S604)


519-519: Function call with shell=True parameter identified, security issue

(S604)


526-526: f-string without any placeholders

Remove extraneous f prefix

(F541)


565-565: f-string without any placeholders

Remove extraneous f prefix

(F541)


573-573: Unused method argument: cmd_idx

(ARG002)


752-752: f-string without any placeholders

Remove extraneous f prefix

(F541)

tests/integration/defs/perf/test_perf.py

722-722: Unused method argument: working_dir

(ARG002)


808-808: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


869-869: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


2499-2499: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


2500-2500: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


2544-2544: Avoid specifying long messages outside the exception class

(TRY003)


2548-2548: Avoid specifying long messages outside the exception class

(TRY003)

🪛 Shellcheck (0.11.0)
jenkins/scripts/perf/disaggregated/install.sh

[warning] 4-4: resourcePathNode is referenced but not assigned.

(SC2154)


[warning] 7-7: jobWorkspace is referenced but not assigned.

(SC2154)


[warning] 9-9: llmTarfile is referenced but not assigned.

(SC2154)


[warning] 10-10: tarName is referenced but not assigned.

(SC2154)


[warning] 15-15: pytestCommand is referenced but not assigned.

(SC2154)


[warning] 23-23: stageName is referenced but not assigned.

(SC2154)

jenkins/scripts/perf/disaggregated/slurm_launch.sh

[warning] 18-18: configYamlFile appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 19-19: testListPathNode appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 50-50: Declare and assign separately to avoid masking return values.

(SC2155)


[warning] 104-104: Prefer mapfile or read -a to split command output (or quote to avoid splitting).

(SC2207)


[error] 106-106: Argument mixes string and array. Use * or separate argument.

(SC2145)


[warning] 109-109: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.

(SC2206)


[warning] 110-110: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.

(SC2206)


[error] 112-112: Argument mixes string and array. Use * or separate argument.

(SC2145)


[error] 113-113: Argument mixes string and array. Use * or separate argument.

(SC2145)

jenkins/scripts/perf/disaggregated/slurm_run.sh

[warning] 5-5: rc is referenced but not assigned.

(SC2154)


[warning] 7-7: resourcePathNode is referenced but not assigned.

(SC2154)


[warning] 42-42: coverageConfigFile is referenced but not assigned.

(SC2154)


[warning] 77-77: perfMode is referenced but not assigned.

(SC2154)


[warning] 78-78: stageName is referenced but not assigned.

(SC2154)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (25)
tests/integration/test_lists/test-db/perf_sanity_l0_dgx_b200.yml (2)

3-36: LGTM: Clean test matrix expansion for 8-GPU configuration.

The expansion from a single entry to separate pre_merge and post_merge blocks with explicit conditions is well-structured and follows the existing pattern used for 4-GPU configurations below.


52-54: LGTM: Consistent test variant additions across stages.

The three new test variants (r1_fp4_v2 with different parallelism settings and gpt_oss_fp4 configurations) are consistently added to both pre_merge and post_merge stages, properly expanding FP4-based test coverage.

Also applies to: 71-73

tests/scripts/perf-sanity/l0_dgx_b200.yaml (1)

1-231: LGTM: Well-structured configuration modernization.

The restructuring introduces a consistent nested configuration schema across all seven server configurations. The standardization of client config naming (e.g., con1024_iter10_1k1k) and the migration to nested blocks (attention_dp_config, moe_config, cuda_graph_config, kv_cache_config, speculative_config) significantly improves maintainability and aligns with the broader schema modernization across the repository.

tests/scripts/perf-sanity/l0_gb200_multi_nodes.yaml (2)

1-66: LGTM: Multi-node configuration properly structured.

The two server configurations correctly specify gpus_per_node: 4 for multi-node distribution. The nested configuration structure is consistent with other perf-sanity YAMLs in this PR.


26-26: Verify the reduced memory fraction for multi-node scenarios.

Both configurations use free_gpu_memory_fraction: 0.5, which is lower than the 0.8 typically used in single-node configurations (e.g., l0_dgx_b200.yaml). Confirm this reduction is intentional for multi-node scenarios and accounts for additional memory overhead from inter-node communication.

Also applies to: 55-55

tests/integration/test_lists/test-db/perf_sanity_l0_gb200_multi_nodes.yml (1)

1-31: LGTM: Clean multi-node test configuration.

The test list properly configures multi-node GB200 testing with clear topology documentation (2 nodes × 4 GPUs). The structure is consistent with existing test list patterns, and both pre_merge and post_merge stages are appropriately configured.

jenkins/scripts/perf/disaggregated/install.sh (2)

1-23: Good error handling with set -Eeuo pipefail.

The script properly uses strict error handling, which will cause it to exit immediately on any command failure, undefined variable, or pipe error.


4-4: Environment variables are provided by the SLURM orchestration context.

Shellcheck warnings about undefined variables (resourcePathNode, jobWorkspace, llmTarfile, tarName, pytestCommand, stageName) can be safely ignored. These variables are exported by the calling SLURM launch script and are part of the documented execution contract for this node initialization script.

Also applies to: 7-7, 9-10, 15-15, 23-23

tests/scripts/perf-sanity/l0_dgx_b300.yaml (1)

1-66: LGTM: Configuration mirrors B200 structure for B300 platform.

The two server configurations follow the same nested schema as l0_dgx_b200.yaml, appropriately adapted for the B300 platform. The consistent structure across platform variants simplifies maintenance and reduces configuration drift.

tests/scripts/perf-sanity/l0_gb200_multi_nodes_disagg.yaml (2)

1-69: LGTM: Well-structured disaggregated configuration.

The disagg_configs structure properly separates hardware topology, timeout, benchmark parameters, and per-server (gen/ctx) configurations. The nested structure is consistent with other perf-sanity YAMLs in this PR.


46-48: Clarify the speculative_config conditional removal guidance.

The comments "when mtp_size is 0, remove this section" appear in both gen and ctx configurations. Verify whether this conditional logic should be handled automatically by the test infrastructure or if it requires manual configuration file editing for different test scenarios.

If this is meant to be handled automatically, consider documenting the logic in the test infrastructure code or adding validation that ensures the speculative_config is consistent with the mtp_size parameter.

Also applies to: 67-69

jenkins/scripts/perf/disaggregated/slurm_exec.py (2)

1-190: LGTM: Well-structured SLURM orchestration with proper error handling.

The script provides comprehensive SLURM job orchestration with proper argument parsing, topology computation from YAML config, job submission, monitoring, and cleanup. The exit code handling and tail process management are correctly implemented.


103-103: shell=True usage is acceptable in this CI context.

While static analysis flags shell=True as a security concern, the usage here is appropriate for a CI orchestration script where:

  1. The sbatch command is constructed from validated arguments
  2. The script runs in a controlled Jenkins/SLURM environment
  3. The complexity of the multi-line sbatch command would be difficult to express without shell interpretation

The security risk is minimal in this context since all inputs are from trusted Jenkins configuration.

Also applies to: 139-139, 157-157

tests/integration/defs/perf/test_perf.py (12)

21-21: LGTM!

The socket import is appropriately used for hostname resolution in the multi-node disaggregated configuration.


318-351: LGTM!

The rename from SERVER_BENCHMARK_PERF_METRIC_LOG_QUERIES to AGGR_SERVER_PERF_METRIC_LOG_QUERIES clarifies that these metrics are specific to aggregated server testing, distinguishing them from the new multi-node disaggregated metrics.


588-601: LGTM!

The addition of hostname and port parameters to to_cmd() enables multi-node deployments while maintaining backward compatibility through sensible defaults.


603-681: LGTM!

The expanded to_db_data() method comprehensively captures all server configuration fields for database storage, including the new nested configuration sections.


683-698: LGTM!

Good defensive coding: creates a copy of the config data before modifying it and properly handles path conversion for speculative_model_dir.


706-719: LGTM!

The refactored ClientConfig constructor follows the same pattern as ServerConfig, using a dictionary for configuration data with appropriate defaults for optional fields.


1043-1049: LGTM!

The addition of disagg_configs to PerfTestConfig is well-documented and follows the same pattern as the aggregated mode configuration fields.


1078-1088: LGTM!

The naming convention clearly distinguishes aggregated server tests (using server: and client: prefixes) from multi-node disaggregated tests (using disagg: prefix).


1494-1508: LGTM!

The configuration setter methods follow a consistent pattern and properly delegate to the respective parsing functions.


1668-1723: Verify shared filesystem assumption for multi-node coordination.

The code writes hostname files to a shared hostnames directory (lines 1669-1671, 1683-1686) that all nodes must access. This assumes a shared filesystem across nodes. Verify this assumption is valid for the target deployment environment, or add documentation about this requirement.


2389-2409: LGTM!

Good design decision to reuse AGGR_SERVER_METRICS for multi-node disaggregated testing, as both modes measure similar performance characteristics.


1264-1270: The review comment is incorrect; no changes are needed.

The code at line 1267 uses "_multi_nodes_disagg" (with 's'), which correctly matches the actual config file naming in tests/scripts/perf-sanity/ (e.g., l0_gb200_multi_nodes_disagg.yaml). The self.runtime constant "multi_node_disagg_server" is a separate label that doesn't need to match the file naming pattern—it serves a different purpose. The code is correct as-is.

Likely an incorrect or invalid review comment.

@chenfeiz0326 chenfeiz0326 force-pushed the chenfeiz/support-disagg-perf-test branch 2 times, most recently from d896ea1 to c6c3fae Compare November 18, 2025 03:51
@chenfeiz0326
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Perf-Sanity-Post-Merge-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24970 [ run ] triggered by Bot. Commit: 0c34d50

@chenfeiz0326
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Perf-Sanity-Post-Merge-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25017 [ run ] triggered by Bot. Commit: 0c34d50

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25017 [ run ] completed with state SUCCESS. Commit: 0c34d50
/LLM/main/L0_MergeRequest_PR pipeline #18901 (Partly Tested) completed with status: 'SUCCESS'

@chenfeiz0326
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Perf-Sanity-Post-Merge-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25444 [ run ] triggered by Bot. Commit: b38b874

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25444 [ run ] completed with state SUCCESS. Commit: b38b874
/LLM/main/L0_MergeRequest_PR pipeline #19258 (Partly Tested) completed with status: 'SUCCESS'

@chenfeiz0326
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Perf-Sanity-Post-Merge-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25475 [ run ] triggered by Bot. Commit: 86a0704

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25475 [ run ] completed with state FAILURE. Commit: 86a0704
/LLM/main/L0_MergeRequest_PR pipeline #19289 (Partly Tested) completed with status: 'FAILURE'

@chenfeiz0326
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Perf-Sanity-Disagg-Post-Merge-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25509 [ run ] triggered by Bot. Commit: 1bc66fa

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25509 [ run ] completed with state FAILURE. Commit: 1bc66fa
/LLM/main/L0_MergeRequest_PR pipeline #19316 (Partly Tested) completed with status: 'FAILURE'

@chenfeiz0326
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Perf-Sanity-Disagg-Post-Merge-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25522 [ run ] triggered by Bot. Commit: d6c7bd1

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25522 [ run ] completed with state FAILURE. Commit: d6c7bd1
/LLM/main/L0_MergeRequest_PR pipeline #19327 (Partly Tested) completed with status: 'FAILURE'

@chenfeiz0326
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Perf-Sanity-Disagg-Post-Merge-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25645 [ run ] triggered by Bot. Commit: 74f1371

@chenfeiz0326
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29966 [ run ] triggered by Bot. Commit: 875e510

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
@chenfeiz0326 chenfeiz0326 force-pushed the chenfeiz/support-disagg-perf-test branch from 875e510 to 97e32b7 Compare December 25, 2025 14:37
@chenfeiz0326
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29966 [ run ] completed with state SUCCESS. Commit: 875e510
/LLM/main/L0_MergeRequest_PR pipeline #23050 (Partly Tested) completed with status: 'SUCCESS'

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29970 [ run ] triggered by Bot. Commit: 97e32b7

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29970 [ run ] completed with state SUCCESS. Commit: 97e32b7
/LLM/main/L0_MergeRequest_PR pipeline #23053 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
@chenfeiz0326
Copy link
Collaborator Author

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Ray-1,GB200-12_GPUs-3_Nodes-PyTorch-Perf-Sanity-Disagg-Post-Merge-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29981 [ run ] triggered by Bot. Commit: fbfcb75

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
@tensorrt-cicd
Copy link
Collaborator

PR_Github #29981 [ run ] completed with state SUCCESS. Commit: fbfcb75
/LLM/main/L0_MergeRequest_PR pipeline #23062 (Partly Tested) completed with status: 'SUCCESS'

@chenfeiz0326
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Perf-Sanity-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-2"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29983 [ run ] triggered by Bot. Commit: fbfcb75

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29984 [ run ] triggered by Bot. Commit: fbfcb75

@chenfeiz0326
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29984 [ run ] completed with state SUCCESS. Commit: fbfcb75
/LLM/main/L0_MergeRequest_PR pipeline #23065 (Partly Tested) completed with status: 'SUCCESS'

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29988 [ run ] triggered by Bot. Commit: 1222715

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29988 [ run ] completed with state SUCCESS. Commit: 1222715
/LLM/main/L0_MergeRequest_PR pipeline #23069 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@chenfeiz0326
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30003 [ run ] triggered by Bot. Commit: 1222715

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30003 [ run ] completed with state SUCCESS. Commit: 1222715
/LLM/main/L0_MergeRequest_PR pipeline #23081 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@chenfeiz0326
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30005 [ run ] triggered by Bot. Commit: 1222715

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30005 [ run ] completed with state SUCCESS. Commit: 1222715
/LLM/main/L0_MergeRequest_PR pipeline #23083 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@chenfeiz0326 chenfeiz0326 merged commit d70aedd into NVIDIA:main Dec 26, 2025
5 checks passed
videodanchik pushed a commit to videodanchik/TensorRT-LLM that referenced this pull request Jan 14, 2026
…#9138)

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Signed-off-by: Daniil Kulko <kulkodaniil@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants